Goto

Collaborating Authors

 effective training


Rep-MTL: Unleashing the Power of Representation-level Task Saliency for Multi-Task Learning

Wang, Zedong, Li, Siyuan, Xu, Dan

arXiv.org Artificial Intelligence

Despite the promise of Multi-T ask Learning in leveraging complementary knowledge across tasks, existing multi-task optimization (MTO) techniques remain fixated on resolving conflicts via optimizer-centric loss scaling and gradient manipulation strategies, yet fail to deliver consistent gains. In this paper, we argue that the shared representation space, where task interactions naturally occur, offers rich information and potential for operations complementary to existing optimizers, especially for facilitating the inter-task complementarity, which is rarely explored in MTO. This intuition leads to Rep-MTL, which exploits the representation-level task saliency to quantify interactions between task-specific optimization and shared representation learning. By steering these saliencies through entropy-based penalization and sample-wise cross-task alignment, Rep-MTL aims to mitigate negative transfer by maintaining the effective training of individual tasks instead pure conflict-solving, while explicitly promoting complementary information sharing. Experiments are conducted on four challenging MTL benchmarks covering both task-shift and domain-shift scenarios. The results show that Rep-MTL, even paired with the basic equal weighting policy, achieves competitive performance gains with favorable efficiency. Beyond standard performance metrics, Power Law exponent analysis demonstrates Rep-MTL's efficacy in balancing task-specific learning and cross-task sharing. The project page is available at HERE.


Building on Efficient Foundations: Effective Training of LLMs with Structured Feedforward Layers

Neural Information Processing Systems

State-of-the-art results in large language models (LLMs) often rely on scale, whichbecomes computationally expensive. This has sparked a research agenda to reducethese models' parameter counts and computational costs without significantlyimpacting their performance. Our study focuses on transformer-based LLMs,specifically targeting the computationally intensive feedforward networks (FFNs),which are less studied than attention blocks. We consider three structured linearparameterizations of the FFN using efficient low-rank and block-diagonal matrices.In contrast to many previous works that examined these approximations, our studyi) explores these structures from a training-from-scratch perspective, ii) scales upto 1.3B parameters, and iii) is conducted within recent Transformer-based LLMsrather than convolutional architectures. We demonstrate that these structures canlead to actual computational gains in various scenarios, including online decodingwhen using a pre-merge technique.


Review for NeurIPS paper: SuperLoss: A Generic Loss for Robust Curriculum Learning

Neural Information Processing Systems

Additional Feedback: Further comments: - The definition of hard and easy examples is limited to their respective confidence scores or losses. Although previous work has similar definitions, confidence or loss are not always good indicators of true easiness or hardness of samples, e.g. they could be erroneous at early iterations. The paper lacks an experiment that illustrates the validity of the above definition. These are probably hard or noisy examples that were mistreated as easy examples by the model? These are probably a mixture of easy, hard, and noisy examples with low confidence across the loss spectrum that were mistreated as hard examples by the model.


S 3 : Sign-Sparse-Shift Reparametrization for Effective Training of Low-bit Shift Networks

Neural Information Processing Systems

Shift neural networks reduce computation complexity by removing expensive multiplication operations and quantizing continuous weights into low-bit discrete values, which are fast and energy-efficient compared to conventional neural networks. However, existing shift networks are sensitive to the weight initialization and yield a degraded performance caused by vanishing gradient and weight sign freezing problem. To address these issues, we propose S 3 re-parameterization, a novel technique for training low-bit shift networks. This way, it efficiently learns a low-bit network with weight dynamics similar to full-precision networks and insensitive to weight initialization. Our proposed training method pushes the boundaries of shift neural networks and shows 3-bit shift networks compete with their full-precision counterparts in terms of top-1 accuracy on ImageNet.


CompeteSMoE -- Effective Training of Sparse Mixture of Experts via Competition

Pham, Quang, Do, Giang, Nguyen, Huy, Nguyen, TrungTin, Liu, Chenghao, Sartipi, Mina, Nguyen, Binh T., Ramasamy, Savitha, Li, Xiaoli, Hoi, Steven, Ho, Nhat

arXiv.org Artificial Intelligence

Sparse mixture of experts (SMoE) offers an appealing solution to scale up the model complexity beyond the mean of increasing the network's depth or width. However, effective training of SMoE has proven to be challenging due to the representation collapse issue, which causes parameter redundancy and limited representation potentials. In this work, we propose a competition mechanism to address this fundamental challenge of representation collapse. By routing inputs only to experts with the highest neural response, we show that, under mild assumptions, competition enjoys the same convergence rate as the optimal estimator. We further propose CompeteSMoE, an effective and efficient algorithm to train large language models by deploying a simple router that predicts the competition outcomes. Consequently, CompeteSMoE enjoys strong performance gains from the competition routing policy while having low computation overheads. Our extensive empirical evaluations on two transformer architectures and a wide range of tasks demonstrate the efficacy, robustness, and scalability of CompeteSMoE compared to state-of-the-art SMoE strategies.


Effective Training of a Neural Network Character Classifier for Word Recognition

Neural Information Processing Systems

We have combined an artificial neural network (ANN) character classifier with context-driven search over character segmentation, word segmentation, and word recognition hypotheses to provide robust recognition of hand-printed English text in new models of Apple Computer's Newton MessagePad. We present some innovations in the training and use of ANNs al; character classifiers for word recognition, including normalized output error, frequency balancing, error emphasis, negative training, and stroke warping. A recurring theme of reducing a priori biases emerges and is discussed.


MicroBERT: Effective Training of Low-resource Monolingual BERTs through Parameter Reduction and Multitask Learning

Gessler, Luke, Zeldes, Amir

arXiv.org Artificial Intelligence

Transformer language models (TLMs) are critical for most NLP tasks, but they are difficult to create for low-resource languages because of how much pretraining data they require. In this work, we investigate two techniques for training monolingual TLMs in a low-resource setting: greatly reducing TLM size, and complementing the masked language modeling objective with two linguistically rich supervised tasks (part-of-speech tagging and dependency parsing). Results from 7 diverse languages indicate that our model, MicroBERT, is able to produce marked improvements in downstream task evaluations relative to a typical monolingual TLM pretraining approach. Specifically, we find that monolingual MicroBERT models achieve gains of up to 18% for parser LAS and 11% for NER F1 compared to a multilingual baseline, mBERT, while having less than 1% of its parameter count. We conclude reducing TLM parameter count and using labeled data for pretraining low-resource TLMs can yield large quality benefits and in some cases produce models that outperform multilingual approaches.


Effective Training of a Neural Network Character Classifier for Word Recognition

Yaeger, Larry S., Lyon, Richard F., Webb, Brandyn J.

Neural Information Processing Systems

We have combined an artificial neural network (ANN) character classifier with context-driven search over character segmentation, word segmentation, and word recognition hypotheses to provide robust recognition of hand-printed English text in new models of Apple Computer's Newton MessagePad. We present some innovations in the training and use of ANNs al; character classifiers for word recognition, including normalized output error, frequency balancing, error emphasis, negative training, and stroke warping. A recurring theme of reducing a priori biases emerges and is discussed.


Effective Training of a Neural Network Character Classifier for Word Recognition

Yaeger, Larry S., Lyon, Richard F., Webb, Brandyn J.

Neural Information Processing Systems

We have been conducting research on bottom-up classification techniques ba;ed on trainable artificial neural networks (ANNs), in combination with comprehensive but weakly-applied language models. To focus our work on a subproblem that is tractable enough to le.:'ld to usable products in a reasonable time, we have restricted the domain to hand-printing, so that strokes are clearly delineated by pen lifts. In the process of optimizing overall performance of the recognizer, we have discovered some useful techniques for architecting and training ANNs that must participate in a larger recognition process. Some of these techniques-especially the normalization of output error, frequency balanCing, and error emphal;is-suggest a common theme of significant value derived by reducing the effect of a priori biases in training data to better represent low frequency, low probability smnples, including second and third choice probabilities. There is mnple prior work in combining low-level classifiers with various search strategies to provide integrated segmentation and recognition for writing (Tappert et al 1990) and speech (Renals et aI1992). And there is a rich background in the use of ANNs a-; classifiers, including their use as a low-level, character classifier in a higher-level word recognition system (Bengio et aI1995).


Effective Training of a Neural Network Character Classifier for Word Recognition

Yaeger, Larry S., Lyon, Richard F., Webb, Brandyn J.

Neural Information Processing Systems

We have been conducting research on bottom-up classification techniques ba;ed on trainable artificial neural networks (ANNs), in combination with comprehensive but weakly-applied language models. To focus our work on a subproblem that is tractable enough to le.:'ld to usable products in a reasonable time, we have restricted the domain to hand-printing, so that strokes are clearly delineated by pen lifts. In the process of optimizing overall performance of the recognizer, we have discovered some useful techniques for architecting and training ANNs that must participate in a larger recognition process. Some of these techniques-especially the normalization of output error, frequency balanCing, and error emphal;is-suggest a common theme of significant value derived by reducing the effect of a priori biases in training data to better represent low frequency, low probability smnples, including second and third choice probabilities. There is mnple prior work in combining low-level classifiers with various search strategies to provide integrated segmentation and recognition for writing (Tappert et al 1990) and speech (Renals et aI1992). And there is a rich background in the use of ANNs a-; classifiers, including their use as a low-level, character classifier in a higher-level word recognition system (Bengio et aI1995).